Here’s my website, queen

About me:

Personal links/code contributions:

Git hub

Zenodo release



Data sets I have actually worked with:

Here is a glimpse of a real life data set about covid that I scraped and cleaned from the CDC
province_state last_update confirmed deaths recovered active case_fatality_ratio
Alabama 2021-01-02 365747 4872 202137 158738 1.3320683
Alaska 2021-01-02 47019 206 7165 39648 0.4381208
Arizona 2021-01-02 530267 9015 76934 444318 1.7000869
Arkansas 2021-01-02 229442 3711 199247 26484 1.6174022
California 2021-01-02 2436449 26168 NA NA 1.0736043
Colorado 2021-01-02 337161 4873 18102 314186 1.4453036
We can rank states by their maximum fatality ratio
And show the cumulative death toll over time (this plot is interactive if you scroll your cursor over the line)

Analysis of a data set regaurding faculty salaries
## # A tibble: 6 × 17
##   FedID UnivName     State Tier  AvgFu…¹ AvgAs…² AvgAs…³ AvgPr…⁴ AvgFu…⁵ AvgAs…⁶
##   <dbl> <chr>        <chr> <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1  1061 Alaska Paci… AK    IIB       454     382     362     382     567     485
## 2  1063 Univ.Alaska… AK    I         686     560     432     508     914     753
## 3  1065 Univ.Alaska… AK    IIA       533     494     329     415     716     663
## 4 11462 Univ.Alaska… AK    IIA       612     507     414     498     825     681
## 5  1002 Alabama Agr… AL    IIA       442     369     310     350     530     444
## 6  1004 University … AL    IIA       441     385     310     388     542     473
## # … with 7 more variables: AvgAssistProfComp <dbl>, AvgProfCompAll <dbl>,
## #   NumFullProfs <dbl>, NumAssocProfs <dbl>, NumAssistProfs <dbl>,
## #   NumInstructors <dbl>, NumFacultyAll <dbl>, and abbreviated variable names
## #   ¹​AvgFullProfSalary, ²​AvgAssocProfSalary, ³​AvgAssistProfSalary,
## #   ⁴​AvgProfSalaryAll, ⁵​AvgFullProfComp, ⁶​AvgAssocProfComp
This is not a “tidy” data set, so I can clean it by writing a function that can be used over and over
## # A tibble: 6 × 14
##   fed_id univ_name      state tier  avg_p…¹ avg_p…² num_i…³ num_f…⁴ rank  salary
##    <dbl> <chr>          <chr> <chr>   <dbl>   <dbl>   <dbl>   <dbl> <chr>  <dbl>
## 1   1061 Alaska Pacifi… AK    IIB       382     487       4      32 full…    454
## 2   1061 Alaska Pacifi… AK    IIB       382     487       4      32 full…    454
## 3   1061 Alaska Pacifi… AK    IIB       382     487       4      32 full…    454
## 4   1061 Alaska Pacifi… AK    IIB       382     487       4      32 full…    454
## 5   1061 Alaska Pacifi… AK    IIB       382     487       4      32 full…    454
## 6   1061 Alaska Pacifi… AK    IIB       382     487       4      32 full…    454
## # … with 4 more variables: comp_type <chr>, comp_amt <dbl>, faculty_type <chr>,
## #   faculty_count <dbl>, and abbreviated variable names ¹​avg_prof_salary_all,
## #   ²​avg_prof_comp_all, ³​num_instructors, ⁴​num_faculty_all
Real life data sets can be pretty hard to understand. It is my job to make them manageable to look at and informative!

The most powerful tool a data analyst has (besides vidualizing data), is modeling data to make presictions.
##                Df    Sum Sq  Mean Sq F value Pr(>F)    
## state          51  51141561  1002776   280.1 <2e-16 ***
## rank            2 151020520 75510260 21089.7 <2e-16 ***
## tier            2  61398662 30699331  8574.2 <2e-16 ***
## Residuals   30139 107910649     3580                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1152 observations deleted due to missingness
This is an ANOVA model which is a linear modeling method to evaluate the relationships between variables. It can rank the variables based on their impact on the outcome. We can use tools like this to identify variables to explore in making changes to our experiments, workflow, or to make predications for the future.
ANOVA is just one method of modeling. There are countless others that are readily usable with R studio. It is my job to use the objectivity of the data software to select the model that best fits each unique data set.